- Articulate the principles of effective data visualisation
- Understand the foundational place of the ‘big 5’ graphics for datavis
- Use
ggplot2to create and modify common forms of data visualisation to publication standard
ggplot2 to create and modify common forms of data visualisation to publication standardProblems typically fall into:
Â
Â
Â
Â
Â
The grammar of graphics
Languages are comprised of different elements: nouns, verbs, articles, subjects, objects, etc.
The rules defining their arrangement into a meaningful whole define the grammar of the language
The grammar of graphics works similarly, by defining a set of rules for constructing visualisations by combining different types of layers (Wilkinson 2005)
In short, the grammar tells us that:
A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.
A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.
A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.
ggplot2 implements this grammar in R, which breaks graphics into three essential components:
data: the dataset containing the variables of interest.geom: the geometric object in question. The type of object we can observe in a plot, such as points, lines, and bars.aes: the aesthetic attributes of the geometric object — their appearance. Colour, shape, size, position. Aesthetic attributes are mapped to variables in the data.The basic template:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOMS>() +
<CUSTOMISATIONS>
We use the ggplot() function to bind the plot to a specific data frame using the data argument
ggplot(data = dat_eco_clean)
Note the plot is just a large grey blob, because we’ve only given it the data (& haven’t told it how to use it).
We then extend it to define an aesthetic mapping using the aes function
ggplot(data = dat_eco_clean, mapping = aes(x = weight, y = hindfoot_length))
Note we have a little more structure here. We’ve told it what, but not how.
Finally (well, not really) we add a geom – our graphical representation of the data in the plot. Here we’re use geom_point to create a scatterplot (i.e. a plot of points), which sensible for the data at hand. And….
ggplot(data = dat_eco_clean, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
We have a (fairly ugly) plot!
ggplot(data = dat_eco_clean, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
We customise the heck out of it in line with our datavis principles.
ggplot(data = dat_eco_clean, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1) +
ylab('Hindfoot length (cm)') +
xlab('Weight (kg)') +
theme_classic()
+ at the end of a line<-The big five graphics
Â
ggplot2geom_point()geom_line()geom_boxplot()geom_histogram()geom_bar() or geom_col()Scatterplots
Â
Specified via geom_point().
Let’s use it to look at some data on the relationship between flight departure delays and arrival delays in the US:
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point()
Let’s take a moment to clean things up & learn some new tricks. Specifically, let’s deal with the overplotting at 0, 0. First: we can try changing the transparency of points.
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2)
Note that transparency is cumulative, which is useful as it adds information.
Alternatively, we can ‘jitter’ the points to randomly nudge them away from one another & create some space. Let’s replace our geom_point() with geom_jitter().
ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30)
Linegraphs
Â
Most commonly used to view temporal relationship (x = minutes, hours, years etc.)
via geom_line()
Let’s take a look at some data on the growth rates of Loblolly pine trees via via the Loblolly dataset.
head(Loblolly)
## height age Seed ## 1 4.51 3 301 ## 15 10.89 5 301 ## 29 28.72 10 301 ## 43 41.74 15 301 ## 57 52.70 20 301 ## 71 60.92 25 301
ggplot(data = Loblolly, mapping = aes(x = age, y = height, group = Seed)) +
geom_line()
What if we want to visually separate those trajectories a bit? We could consider using colour instead of group:
ggplot(data = Loblolly, mapping = aes(x = age, y = height, colour = Seed)) +
geom_line()
We could also consider using facet_wrap(), which facets the plot based on a variable that we specify (here, Seed).
ggplot(data = Loblolly, mapping = aes(x = age, y = height)) +
geom_line() +
facet_wrap(~Seed)
geom_jitter() can replace geom_point() to create some visual separation between pointsgroup and colour aesthetic arguments, to different endsfacet_wrap() is a handy layer for splitting up groups into their constituent elementsBoxplots
Â
Let’s build one from scratch using CO2 data:
head(CO2)
## Plant Type Treatment conc uptake ## 1 Qn1 Quebec nonchilled 95 16.0 ## 2 Qn1 Quebec nonchilled 175 30.4 ## 3 Qn1 Quebec nonchilled 250 34.8 ## 4 Qn1 Quebec nonchilled 350 37.2 ## 5 Qn1 Quebec nonchilled 500 35.3 ## 6 Qn1 Quebec nonchilled 675 39.2
Let’s build one from scratch using CO2 data:
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot()
Let’s build one from scratch using CO2 data:
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot()
Great! But eh, it’s a bit ugly, and it’s also defying one of our basic principles. Let’s start wandering further down the customisation rabbit-hole.
Let’s try combining multiple geoms for a more information-rich plot.
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot() +
geom_point()
Can we do better?
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.5)
What about those ugly wide boxes?
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot(width = 0.3) +
geom_jitter(width = 0.1, alpha = 0.5)
Nice! And now for another new trick, can we fix that noisy background?
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot(width = 0.3) +
geom_jitter(width = 0.1, alpha = 0.5) +
theme_classic()
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot(width = 0.3) +
geom_jitter(width = 0.1, alpha = 0.5) +
theme_classic()
There are still two things that bug me about this plot.
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot(width = 0.3) +
geom_jitter(width = 0.1, alpha = 0.5) +
ylim(0, 50) +
ylab('Uptake (ppm)') +
theme_classic()
xlim() and ylim() allow simple control of axis limits (there are also more fine-grained methods)xlab() and ylab() do the same for axis labelstheme_ to exercise fine control over the non-data elements on your plotHistograms
Â
Â
Powerful diagnostic tool:
geom_histogram()Let’s take a look at our CO2 data again. We’ll just begin with one of our treatments to get a feel for it.
ggplot(data = filter(CO2, Treatment == 'chilled'), mapping = aes(x = uptake)) + geom_histogram()
As before, it’s a little ugly and hard to see, but there are simple tweaks we can make
ggplot(data = filter(CO2, Treatment == 'chilled'), mapping = aes(x = uptake)) + geom_histogram(binwidth = 1, color = "white")
Here binwidth and color make for an easier, more information-rich plot.
We have more than one treatment though! Now what?
ggplot(data = CO2, mapping = aes(x = uptake)) + geom_histogram(binwidth = 1, color = "white") + facet_wrap(~Treatment, nrow = 2)
Barplots
Â
geom_bar() OR geom_col, depending on the dataUse geom_col() when data are ‘pre-counted’, otherwise geom_bar(). Let’s try both.
Raw:
insects
## # A tibble: 5 × 1 ## order ## <chr> ## 1 Lepidoptera ## 2 Lepidoptera ## 3 Orthoptera ## 4 Orthoptera ## 5 Hymenoptera
Pre-counted:
insects_counted
## # A tibble: 3 × 2 ## order count ## <chr> <dbl> ## 1 Lepidoptera 2 ## 2 Orthoptera 2 ## 3 Hymenoptera 1
ggplot(insects,
aes(x = order)) +
geom_bar()
ggplot(insects_counted,
aes(x = order, y = count)) +
geom_col()
Final tricks
patchworklibrary(patchwork) makes it simple to combine multiple plots into multi-panel figures
Â
<-+ and / to combine them. + combines horizontally, / vertically, and () for more complex arrangements.patchwork<-Â
plot_1 <-
ggplot(data = CO2, mapping = aes(x = Treatment, y = uptake)) +
geom_boxplot(width = 0.3) +
theme_classic()
plot_2 <-
ggplot(data = CO2, mapping = aes(x = Type, y = conc)) +
geom_boxplot(width = 0.3) +
theme_classic()
patchwork+ and / to combine them. + combines horizontally, / vertically, and () for more complex arrangements.Â
plot_1 + plot_2
ggsave()ggsave() saves your ggplot2s, and makes some sane guesses about what you want. By default, it’ll save the last plot your displayed, or a saved plot can be specified.
Â
ggsave('my_barplot.png'): creates a local .png file called my_barplot with your last plot in itggsave('my_barplot.jpg', insect_barplot): creates a .jpg file called my_barplot, containing a plot saved in the object insect_barplotggsave('my_barplot.tiff', width = 10, height = 10, units = 'cm'): creates a 10 cm x 10 cm plot in .tiff format, with your last plot in it.?ggsave for options.ggplot2 to create and modify common forms of data visualisationThanks!